Skip to content

Conversation

@orlp
Copy link
Member

@orlp orlp commented May 30, 2025

Note

TLDR

Categoricals are completely reimplemented to be streaming compatible and fit better into the Polars Data model. They should generally be faster, more stable and more reliable. Physical ordering and the String Cache are gone. View #22568 for more context.

Fixes #3036.
Fixes #14247.
Fixes #14996.
Fixes #15293.
Fixes #15781.
Fixes #17479.
Fixes #17643.
Fixes #18065.
Fixes #18501.
Fixes #19868.
Fixes #19943.
Fixes #20290.
Fixes #20318.
Fixes #20364.
Fixes #20562.
Fixes #20878.
Fixes #20931.
Fixes #21175.
Fixes #21583.
Fixes #22448.
Fixes #22586.
Fixes #22664.
Fixes #22830.
Fixes #23015.
Fixes #23071.
Fixes #23289.

This PR, essentially, replaces the entire Categorical/Enum implementation. There is some breakage that was essentially unavoidable, unfortunately:

  • Physical ordering for Categoricals has been removed, the ordering is now always lexical. The parameter has been deprecated, it is not a hard error to pass "physical" as ordering, it just doesn't do anything anymore.
  • A new file format for Parquet is introduced. Reading older Parquet files is backwards-compatible, but writing new files with Enums in them are read back as Categoricals by older versions of Polars.
  • Casts between Categorical and integer types now always refer to the physical categories. These casts will be deprecated and removed at a later stage once we have dedicated functions to go to/from categories. The casts to/from String still exist and will remain so, any other casts have been removed.

The concept of local and global categories is gone. The StringCache still exists in Python, but does nothing anymore, and will be deprecated and removed later.

In a future PR we will expose the new capabilities of the new Categories system, which lets you specify in the DataType which columns should share the same categorical mapping.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels May 30, 2025
@orlp orlp force-pushed the cat-rework branch 5 times, most recently from 72307c2 to 863cf09 Compare June 6, 2025 13:37
@orlp orlp force-pushed the cat-rework branch 4 times, most recently from ddb7532 to 9036ef6 Compare July 1, 2025 10:02
@orlp
Copy link
Member Author

orlp commented Jul 4, 2025

@coastalwhite I addressed most of your concerns, please respond to the others.

dhimmel added a commit to dhimmel/openskistats that referenced this pull request Oct 29, 2025
lorentzenchr added a commit to lorentzenchr/model-diagnostics that referenced this pull request Nov 11, 2025
lorentzenchr added a commit to lorentzenchr/model-diagnostics that referenced this pull request Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

highlight Highlight this PR in the changelog internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars

Projects

None yet

3 participants